DAY[22]-Kaggle實戰補值與特徵新增(2)

第 11 屆 iThome 鐵人賽

DAY 22

AI & Data

Python機器學習介紹與實戰系列第 22 篇

11th鐵人賽 python3 machine learning

Austin

團隊Bikini Bottom

2019-10-07 21:06:19

1661 瀏覽

分享至

延續上一次的補值，在特殊的行當中，我們可以從行本身的意義判斷出應該補的值，例如當車庫的屬性為空值，可能原因就是該棟房子並沒有車庫，因此這些相關的面積等等資訊都可以填入0，而非數值的資訊則可以視原先的數據中，明確填入"沒有"的資料是什麼，在這裡我們填入None。

for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    features[col] = features[col].fillna(0)

for col in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
    features[col] = features[col].fillna('None')
# 地下室條件與車庫相同
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    features[col] = features[col].fillna('None')

部分的特徵則會根據類似的特徵屬性分群，並填入分群之後的眾數，transform在這裡是分組運算的概念，將各個groupby分群底下class的眾數填入對應的群組當中。

features['MSZoning'] = features.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))

剩餘的所有類別變數都一律填入None，並紀錄被補值的所有columns。

objects = []
for i in features.columns:
    if features[i].dtype == object:
        objects.append(i)
features.update(features[objects].fillna('None'))
print(objects)

數值型變數也要一併補上，才能將所有值都補齊。

features['LotFrontage'] = features.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))

numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerics = []
for i in features.columns:
    if features[i].dtype in numeric_dtypes:
        numerics.append(i)
features.update(features[numerics].fillna(0))
numerics[1:10]